Description: Explore the diverse demographics of Saudi Arabia through this comprehensive dataset showcasing population statistics across various parameters. The dataset contains records detailing the population dynamics in terms of gender, nationality, region, and year.
Columns:
Key Insights:
Why Explore this Dataset?
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
df = pd.read_csv('KSA_population.csv')
df.head()
| Gender | Nationality | Region | Year | Population | |
|---|---|---|---|---|---|
| 0 | Female | Non Saudi | Al Bahah | 2010 | 16209 |
| 1 | Female | Non Saudi | Al Bahah | 2011 | 16521 |
| 2 | Female | Non Saudi | Al Bahah | 2012 | 16752 |
| 3 | Female | Non Saudi | Al Bahah | 2013 | 17508 |
| 4 | Female | Non Saudi | Al Bahah | 2014 | 17682 |
# Step 1: Data shape
print(df.shape)
rows, columns = df.shape
print(f"Num of Rows: {rows} ") # instances
print(f"Num of Columns: {columns} ") # series
print(f"The size (rows x columns) is: {df.size}") # size
print(f"The Dimensions are: {df.ndim}") # dimensions
(676, 5) Num of Rows: 676 Num of Columns: 5 The size (rows x columns) is: 3380 The Dimensions are: 2
df.columns
Index(['Gender', 'Nationality', 'Region', 'Year', 'Population'], dtype='object')
df.nunique()
Gender 2 Nationality 2 Region 13 Year 13 Population 675 dtype: int64
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 676 entries, 0 to 675 Data columns (total 5 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Gender 676 non-null object 1 Nationality 676 non-null object 2 Region 676 non-null object 3 Year 676 non-null int64 4 Population 676 non-null int64 dtypes: int64(2), object(3) memory usage: 26.5+ KB
We can see that the dataset contains mixture of categorical and numerical variables.
Categorical variables have data type object.
Numerical variables have data type int64.
df.describe()
| Year | Population | |
|---|---|---|
| count | 676.000000 | 6.760000e+02 |
| mean | 2016.000000 | 5.587169e+05 |
| std | 3.744428 | 7.062714e+05 |
| min | 2010.000000 | 1.430400e+04 |
| 25% | 2013.000000 | 1.098975e+05 |
| 50% | 2016.000000 | 2.495590e+05 |
| 75% | 2019.000000 | 6.180745e+05 |
| max | 2022.000000 | 3.406281e+06 |
The above command df.describe() helps us to view the statistical properties of numerical variables. It excludes character variables.
If we want to view the statistical properties of character variables, we should run the following command -
df.describe(include=['object'])
If we want to view the statistical properties of all the variables, we should run the following command -
df.describe(include='all')
Gender Variable¶df['Gender'].unique()
array(['Female', 'Male'], dtype=object)
We can see that the number of unique values in Gender variable is 2.
The two unique values are Male and Female
df['Gender'].value_counts()
Female 338 Male 338 Name: Gender, dtype: int64
f, ax = plt.subplots(figsize=(4, 4))
ax = sns.countplot(x="Gender", data=df, palette="Set1")
plt.show()
Nationallity variable¶df['Nationality'].nunique()
2
df['Nationality'].unique()
array(['Non Saudi', 'Saudi'], dtype=object)
We can see that the number of unique values in Nationality variable is 2.
The two unique values are Non Saudi and Saudi
df['Nationality'].value_counts()
Non Saudi 338 Saudi 338 Name: Nationality, dtype: int64
f, ax = plt.subplots(figsize=(8, 4))
ax = sns.countplot(y="Nationality", data=df, palette="Set1")
plt.show()
Region Variable¶df['Region'].nunique()
13
df['Region'].unique()
array(['Al Bahah', 'Al Jawf', 'Al Madinah Al Munawwarah', 'Al Qaseem',
'Ar Riyadh', 'Aseer', 'Eastern Region', 'Hail', 'Jazan',
'Makkah Al Mukarramah', 'Najran', 'Northern Borders', 'Tabuk'],
dtype=object)
We can see that the number of unique values in Region variable are 13.
The 13 unique values are Al Bahah, Al Jawf, Al Madinah Al Munawwarah, Al Qaseem,
Ar Riyadh, Aseer, Eastern Region, Hail, Jazan,
Makkah Al Mukarramah, Najran, Northern Borders, Tabuk
f, ax = plt.subplots(figsize=(10, 4))
ax = sns.countplot(y="Region", data=df, palette="Set2")
plt.show()
Year Variable¶df['Year'].nunique()
13
Year is a numeric variable.
df['Year'].describe()
count 676.000000 mean 2016.000000 std 3.744428 min 2010.000000 25% 2013.000000 50% 2016.000000 75% 2019.000000 max 2022.000000 Name: Year, dtype: float64
#boxplot
sns.boxplot(data=df, y='Year')
plt.ylabel("Years")
plt.title("Box Plot of Your Year Variable")
plt.show()
#boxenplot
sns.boxenplot(data=df, y='Year')
<AxesSubplot:ylabel='Year'>
#violinplot
sns.violinplot(data=df, y='Year')
plt.figure(figsize=(6,8))
<Figure size 600x800 with 0 Axes>
<Figure size 600x800 with 0 Axes>
Boxplot, boxenplot and Violinplot) shows us the distribution of a numeric column.Population Variable¶df['Population'].describe()
count 6.760000e+02 mean 5.587169e+05 std 7.062714e+05 min 1.430400e+04 25% 1.098975e+05 50% 2.495590e+05 75% 6.180745e+05 max 3.406281e+06 Name: Population, dtype: float64
#Histogram
sns.histplot(data=df, x='Population', kde=True, bins= 5)
plt.xlabel("Population")
plt.ylabel("Frequency")
plt.title("Histogram of Population Variable")
plt.show()
#boxplot
sns.boxenplot(df, x= 'Population')
<AxesSubplot:xlabel='Population'>
#violinplot
sns.violinplot(df, x= 'Population')
<AxesSubplot:xlabel='Population'>
#boxplot
sns.boxplot(data=df, y='Population')
plt.ylim(top= 1000000)
plt.ylabel("Years")
plt.title("Box Plot of Your Year Variable")
plt.show()
# Checking skewness of the data
df['Population'].skew()
1.8921844651674402
Population column is 1.89categorical = [var for var in df.columns if df[var].dtype=='O']
print('There are {} categorical variables\n'.format(len(categorical)))
print('The categorical variables are :', categorical)
There are 3 categorical variables The categorical variables are : ['Gender', 'Nationality', 'Region']
# Lets take a look
df[categorical].head()
| Gender | Nationality | Region | |
|---|---|---|---|
| 0 | Female | Non Saudi | Al Bahah |
| 1 | Female | Non Saudi | Al Bahah |
| 2 | Female | Non Saudi | Al Bahah |
| 3 | Female | Non Saudi | Al Bahah |
| 4 | Female | Non Saudi | Al Bahah |
The number of labels within a categorical variable is known as cardinality. A high number of labels within a variable is known as high cardinality. High cardinality may pose some serious problems in the machine learning model. So, I will check for high cardinality.
# check for cardinality in categorical variables
for var in categorical:
print(var, ' contains ', len(df[var].unique()), ' labels')
Gender contains 2 labels Nationality contains 2 labels Region contains 13 labels
df[categorical].isnull().sum()
Gender 0 Nationality 0 Region 0 dtype: int64
plt.rcParams['figure.figsize'] = (15,6)
sns.heatmap(df.isnull(),yticklabels = False, cbar = False , cmap = 'viridis')
plt.title("Missing null values")
Text(0.5, 1.0, 'Missing null values')
import seaborn as sns
# make sure that the variable 'categorical' is defined in a previous cell
sns.pairplot(df, hue= 'Gender')
<seaborn.axisgrid.PairGrid at 0x22c6b331490>
correlation = df.corr()
plt.figure(figsize=(10,5))
plt.title('Correlation Heatmap of Rain in Australia Dataset')
ax = sns.heatmap(correlation, square=True, annot=True, fmt='.2f', linecolor='white')
ax.set_xticklabels(ax.get_xticklabels(), rotation=90)
ax.set_yticklabels(ax.get_yticklabels(), rotation=30)
plt.show()
Population and Year are positively correlateddf.groupby('Year')['Population'].sum().sort_values(ascending= False).plot(kind= 'bar')
plt.get_backend()
'module://matplotlib_inline.backend_inline'
Total_population_by_year¶# Creating column('Total_population_by_year')
df['Total_population_by_year'] = df.groupby('Year')['Population'].transform('sum')
df.head()
| Gender | Nationality | Region | Year | Population | Total_population_by_year | |
|---|---|---|---|---|---|---|
| 0 | Female | Non Saudi | Al Bahah | 2010 | 16209 | 23978487 |
| 1 | Female | Non Saudi | Al Bahah | 2011 | 16521 | 25091867 |
| 2 | Female | Non Saudi | Al Bahah | 2012 | 16752 | 26168861 |
| 3 | Female | Non Saudi | Al Bahah | 2013 | 17508 | 27624004 |
| 4 | Female | Non Saudi | Al Bahah | 2014 | 17682 | 28309273 |
# Converting 'total_population_by_year' into millions
df['Total_population_by_year']= round(df['Total_population_by_year'] / 1e6, 1)
df.head()
| Gender | Nationality | Region | Year | Population | Total_population_by_year | |
|---|---|---|---|---|---|---|
| 0 | Female | Non Saudi | Al Bahah | 2010 | 16209 | 24.0 |
| 1 | Female | Non Saudi | Al Bahah | 2011 | 16521 | 25.1 |
| 2 | Female | Non Saudi | Al Bahah | 2012 | 16752 | 26.2 |
| 3 | Female | Non Saudi | Al Bahah | 2013 | 17508 | 27.6 |
| 4 | Female | Non Saudi | Al Bahah | 2014 | 17682 | 28.3 |
# Renaming the column
df.rename(columns= {"Total_population_by_year": "Year_population(million)"}, inplace= True)
df.head()
| Gender | Nationality | Region | Year | Population | Year_population(million) | |
|---|---|---|---|---|---|---|
| 0 | Female | Non Saudi | Al Bahah | 2010 | 16209 | 24.0 |
| 1 | Female | Non Saudi | Al Bahah | 2011 | 16521 | 25.1 |
| 2 | Female | Non Saudi | Al Bahah | 2012 | 16752 | 26.2 |
| 3 | Female | Non Saudi | Al Bahah | 2013 | 17508 | 27.6 |
| 4 | Female | Non Saudi | Al Bahah | 2014 | 17682 | 28.3 |
Division by 1e6:
The expression df['Total_population_by_year'] / 1e6 divides the values in the 'Total_population_by_year' column by 1,000,000, effectively converting the population counts to millions.
Rounding to One Decimal Place:
The round() function is used to round the division result to one decimal place. This ensures the population values are presented with a reduced precision, providing a clearer view of the data in millions.
Assignment to a New Column: The transformed values are then assigned back to the 'Total_population_by_year' column, replacing the original population counts with their corresponding values in millions, rounded to one decimal place.
sns.barplot(data=df, x='Year', y='Population', hue='Gender',palette='Set1', errorbar= ('ci', 0))
plt.title("Population of Male/Female from 2010-2022")
Text(0.5, 1.0, 'Population of Male/Female from 2010-2022')
df.groupby(df['Gender']=="Male")["Population"].count()
Gender False 338 True 338 Name: Population, dtype: int64
population_by_year = df.groupby('Year')['Population'].sum()
average_growth_rate = (population_by_year.iloc[-1] - population_by_year.iloc[0]) / (population_by_year.index[-1] - population_by_year.index[0])
print("Average population growth rate per year:", average_growth_rate)
Average population growth rate per year: 683061.4166666666
population_over_years = df.groupby('Year')['Population'].sum()
plt.figure(figsize=(10, 6))
population_over_years.plot(kind='line', marker='o')
plt.title('Population Trend over Years')
plt.xlabel('Year')
plt.ylabel('Population')
plt.grid(True)
plt.show()
plt.figure(figsize=(10, 6))
sns.boxplot(data=df, x='Year', y='Population', hue='Nationality')
plt.title('Population Variation by Nationality Over Years')
plt.xlabel('Year')
plt.ylabel('Population')
plt.legend(title='Nationality', bbox_to_anchor=(1, 1))
plt.show()
px.bar( df.groupby('Region')['Population'].max().sort_values(ascending= False))
px.bar(df.groupby('Gender')['Population'].max().sort_values(ascending= False))
#plotting the data
px.bar(df, x= "Year", y= 'Population', color= 'Nationality')
# The code generates a bar plot using the specified DataFrame and columns, providing a visual representation of how population counts
# in thousands vary over different years, segmented by nationality.
# Each bar in the plot represents the population count for a specific nationality category in a given year.
px.bar(df.groupby(df['Region']== 'Jazan')['Population'].mean())
fig = px.bar(df, x='Region', y='Population', title='Population by Region',
labels={'Population': 'Total Population', 'Region': 'Region'},
color='Region')
fig.update_layout(barmode='stack')
fig.show()
Skimpy and Dataprepimport skimpy
from skimpy import skim
skim(df)
╭──────────────────────────────────────────────── skimpy summary ─────────────────────────────────────────────────╮ │ Data Summary Data Types │ │ ┏━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┓ ┏━━━━━━━━━━━━━┳━━━━━━━┓ │ │ ┃ dataframe ┃ Values ┃ ┃ Column Type ┃ Count ┃ │ │ ┡━━━━━━━━━━━━━━━━━━━╇━━━━━━━━┩ ┡━━━━━━━━━━━━━╇━━━━━━━┩ │ │ │ Number of rows │ 676 │ │ string │ 3 │ │ │ │ Number of columns │ 6 │ │ int32 │ 2 │ │ │ └───────────────────┴────────┘ │ float64 │ 1 │ │ │ └─────────────┴───────┘ │ │ number │ │ ┏━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━┳━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━┓ │ │ ┃ column_name ┃ NA ┃ NA % ┃ mean ┃ sd ┃ p0 ┃ p25 ┃ p75 ┃ p100 ┃ hist ┃ │ │ ┡━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━╇━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━┩ │ │ │ Year │ 0 │ 0 │ 2000 │ 3.7 │ 2000 │ 2000 │ 2000 │ 2000 │ ▅▅▅▅▅█ │ │ │ │ Population │ 0 │ 0 │ 560000 │ 710000 │ 14000 │ 110000 │ 620000 │ 3400000 │ █▁▁▁ │ │ │ │ Year_population(mill │ 0 │ 0 │ 29 │ 2.5 │ 24 │ 28 │ 31 │ 32 │ ▄▂▂▂██ │ │ │ └─────────────────────────┴─────┴───────┴─────────┴─────────┴────────┴─────────┴─────────┴──────────┴────────┘ │ │ string │ │ ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━┓ │ │ ┃ column_name ┃ NA ┃ NA % ┃ words per row ┃ total words ┃ │ │ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━┩ │ │ │ Gender │ 0 │ 0 │ 1 │ 680 │ │ │ │ Nationality │ 0 │ 0 │ 1 │ 680 │ │ │ │ Region │ 0 │ 0 │ 1 │ 680 │ │ │ └───────────────────────────┴─────────┴────────────┴──────────────────────────────┴──────────────────────────┘ │ ╰────────────────────────────────────────────────────── End ──────────────────────────────────────────────────────╯
!pip install dataprep
from dataprep.eda import create_report
create_report(df)
0%| | 0/884 [00:00<?, ?it/s]
c:\Users\Tariq Laptops\AppData\Local\Programs\Python\Python39\lib\site-packages\dask\core.py:119: RuntimeWarning: invalid value encountered in divide c:\Users\Tariq Laptops\AppData\Local\Programs\Python\Python39\lib\site-packages\dataprep\eda\distribution\render.py:274: FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
| Number of Variables | 6 |
|---|---|
| Number of Rows | 676 |
| Missing Cells | 0 |
| Missing Cells (%) | 0.0% |
| Duplicate Rows | 0 |
| Duplicate Rows (%) | 0.0% |
| Total Size in Memory | 143.5 KB |
| Average Row Size in Memory | 217.3 B |
| Variable Types |
|
| Population is skewed | Skewed |
|---|---|
| Year_population(million) is skewed | Skewed |
categorical
| Approximate Distinct Count | 2 |
|---|---|
| Approximate Unique (%) | 0.3% |
| Missing | 0 |
| Missing (%) | 0.0% |
| Memory Size | 47320 |
| Mean | 5 |
|---|---|
| Standard Deviation | 1.0007 |
| Median | 5 |
| Minimum | 4 |
| Maximum | 6 |
| 1st row | Female |
|---|---|
| 2nd row | Female |
| 3rd row | Female |
| 4th row | Female |
| 5th row | Female |
| Count | 3380 |
|---|---|
| Lowercase Letter | 2704 |
| Space Separator | 0 |
| Uppercase Letter | 676 |
| Dash Punctuation | 0 |
| Decimal Number | 0 |
categorical
| Approximate Distinct Count | 2 |
|---|---|
| Approximate Unique (%) | 0.3% |
| Missing | 0 |
| Missing (%) | 0.0% |
| Memory Size | 48672 |
| Mean | 7 |
|---|---|
| Standard Deviation | 2.0015 |
| Median | 7 |
| Minimum | 5 |
| Maximum | 9 |
| 1st row | Non Saudi |
|---|---|
| 2nd row | Non Saudi |
| 3rd row | Non Saudi |
| 4th row | Non Saudi |
| 5th row | Non Saudi |
| Count | 4394 |
|---|---|
| Lowercase Letter | 3380 |
| Space Separator | 338 |
| Uppercase Letter | 1014 |
| Dash Punctuation | 0 |
| Decimal Number | 0 |
categorical
| Approximate Distinct Count | 13 |
|---|---|
| Approximate Unique (%) | 1.9% |
| Missing | 0 |
| Missing (%) | 0.0% |
| Memory Size | 50804 |
| Mean | 10.1538 |
|---|---|
| Standard Deviation | 6.142 |
| Median | 8 |
| Minimum | 4 |
| Maximum | 24 |
| 1st row | Al Bahah |
|---|---|
| 2nd row | Al Bahah |
| 3rd row | Al Bahah |
| 4th row | Al Bahah |
| 5th row | Al Bahah |
| Count | 6292 |
|---|---|
| Lowercase Letter | 5044 |
| Space Separator | 572 |
| Uppercase Letter | 1248 |
| Dash Punctuation | 0 |
| Decimal Number | 0 |
numerical
| Approximate Distinct Count | 13 |
|---|---|
| Approximate Unique (%) | 1.9% |
| Missing | 0 |
| Missing (%) | 0.0% |
| Infinite | 0 |
| Infinite (%) | 0.0% |
| Memory Size | 10816 |
| Mean | 2016 |
| Minimum | 2010 |
| Maximum | 2022 |
| Zeros | 0 |
| Zeros (%) | 0.0% |
| Negatives | 0 |
| Negatives (%) | 0.0% |
| Minimum | 2010 |
|---|---|
| 5-th Percentile | 2010 |
| Q1 | 2013 |
| Median | 2016 |
| Q3 | 2019 |
| 95-th Percentile | 2022 |
| Maximum | 2022 |
| Range | 12 |
| IQR | 6 |
| Mean | 2016 |
|---|---|
| Standard Deviation | 3.7444 |
| Variance | 14.0207 |
| Sum | 1.3628e+06 |
| Skewness | 0 |
| Kurtosis | -1.2143 |
| Coefficient of Variation | 0.001857 |
numerical
| Approximate Distinct Count | 675 |
|---|---|
| Approximate Unique (%) | 99.9% |
| Missing | 0 |
| Missing (%) | 0.0% |
| Infinite | 0 |
| Infinite (%) | 0.0% |
| Memory Size | 10816 |
| Mean | 558716.8994 |
| Minimum | 14304 |
| Maximum | 3406281 |
| Zeros | 0 |
| Zeros (%) | 0.0% |
| Negatives | 0 |
| Negatives (%) | 0.0% |
| Minimum | 14304 |
|---|---|
| 5-th Percentile | 28743 |
| Q1 | 109897.5 |
| Median | 249559 |
| Q3 | 618074.5 |
| 95-th Percentile | 2.0754e+06 |
| Maximum | 3406281 |
| Range | 3391977 |
| IQR | 508177 |
| Mean | 558716.8994 |
|---|---|
| Standard Deviation | 706271.3702 |
| Variance | 4.9882e+11 |
| Sum | 3.7769e+08 |
| Skewness | 1.888 |
| Kurtosis | 2.9996 |
| Coefficient of Variation | 1.2641 |
numerical
| Approximate Distinct Count | 12 |
|---|---|
| Approximate Unique (%) | 1.8% |
| Missing | 0 |
| Missing (%) | 0.0% |
| Infinite | 0 |
| Infinite (%) | 0.0% |
| Memory Size | 10816 |
| Mean | 29.0692 |
| Minimum | 24 |
| Maximum | 32.2 |
| Zeros | 0 |
| Zeros (%) | 0.0% |
| Negatives | 0 |
| Negatives (%) | 0.0% |
| Minimum | 24 |
|---|---|
| 5-th Percentile | 24 |
| Q1 | 27.6 |
| Median | 30.1 |
| Q3 | 31 |
| 95-th Percentile | 32.2 |
| Maximum | 32.2 |
| Range | 8.2 |
| IQR | 3.4 |
| Mean | 29.0692 |
|---|---|
| Standard Deviation | 2.5129 |
| Variance | 6.3145 |
| Sum | 19650.8 |
| Skewness | -0.7319 |
| Kurtosis | -0.7589 |
| Coefficient of Variation | 0.08644 |